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High-dimensional classification has become an increasingly im- 
portant problem. In this paper we propose a "Multivariate Adap- 
tive Stochastic Search" (MASS) approach which first reduces the di- 
mension of the data space and then applies a standard classification 
method to the reduced space. One key advantage of MASS is that it 
automatically adjusts to mimic variable selection type methods, such 
as the Lasso, variable combination methods, such as PCA, or meth- 
ods that combine these two approaches. The adaptivity of MASS 
allows it to perform well in situations where pure variable selection 
or variable combination methods fail. Another major advantage of 
our approach is that MASS can accurately project the data into very 
low-dimensional non-linear, as well as linear, spaces. MASS uses a 
stochastic search algorithm to select a handful of optimal projection 
directions from a large number of random directions in each iteration. 
We provide some theoretical justification for MASS and demonstrate 
its strengths on an extensive range of simulation studies and real 
world data sets by comparing it to many classical and modern clas- 
sification methods. 

1. Introduction. An increasingly important topic is the classification of 
observations into two or more predefined groups when the number of predic- 
tors, d, is larger than the number of observations, n. For example, one may 
need to identify at what time a subject is performing a specific task based 
on hundreds of thousands of brain voxel values in a functional Magnetic 
Resonance Imaging (fMRI) study, where changes in blood flow and blood 
oxygenation are measured when brain neurons are activated. In particular, 
the data set that motivated the development of the methodology in this 
paper was an fMRI study based on vision research. We wish to predict, at 
a given time, whether a subject is conducting a task or resting (baseline), 
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based on the activity level of the observed voxels in the subject's brain. 
This is a difficult task because the number of voxels, d, is much higher than 
the number of observations, n. In this case, many conventional classification 
methods, such as Fisher's discriminant rule, are not applicable since n < d. 
Other methods, such as classification trees, fc-nearest neighbors and logistic 
discriminant analysis, do not explicitly require n> d, but in practice provide 
poor classification accuracy in this situation. 

A common solution is to first reduce the data into a lower, p d di- 
mensional space and then perform classification on the transformed data. 
For example, Fan and Lv (2008) provide significant theoretical and empir- 
ical evidence for the power of such an approach. Generally, the dimension 
reduction is performed using a linear transformation of the form 

(1) Z = XA, 

where X is an n-by-ci data matrix and A is a d-hy-p transformation ma- 
trix which projects X onto a p-dimensional subspace Z (p < d). However, 
there are many possible approaches to choosing A. We divide the methods 
into supervised versus unsupervised and variable selection versus variable 
combination. 

Variable selection techniques select a subset of relevant variables that have 
good predictive power, thus obtaining a subset of informative variables from 
a set of more complex variables. In this setting, A is some row permutation of 
the identity matrix and the zero matrix, that is, A = perm([I p , 0(d-p)xp] T )- 
If we define sparsity as the fraction of zero elements in a given matrix, then A 
would be considered sparse because most of its components are zeros. Many 
variable selection methods have been proposed and widely used in numerous 
areas. A great deal of attention is paid to the L\ penalized least squares esti- 
mator [i.e., the Lasso Tibshirani (1996), Efron et al. (2004)]. Other methods 
include SCAD Fan and Li (2001), nearest shrunken centroids Tibshirani et 
at. (2002), the Elastic Net Zou and Hastie (2005), Dantzig selector Candes 
and Tao (2007), VISA Radchenko and James (2008), FLASH Radchenko 
and James (2009) and Bayesian methods of variable selection Mitchell and 
Beauchamp (1988), George and McCulloch (1993, 1997). These methods all 
use supervised learning, where the response and predictors are both utilized 
to obtain the subset of variables. In addition, because these methods always 
select a subset of the original variables, they provide highly interpretable 
results. However, because of the sparsity of A, for a given p, variable se- 
lection methods are less efficient at compressing the observed data, X. For 
example, they may discard potentially valuable variables which are not pre- 
dictive individually but provide significant improvement in conjunction with 
others. 

In comparison, variable combination methods utilize a dense A which 
combines correlated variables and hence does well on multicollinearities 



MULTIVARIATE ADAPTIVE STOCHASTIC SEARCH 



3 



which often occur in high-dimensional data. Probably the most commonly 
applied method in this category is principal component analysis (PCA). Us- 
ing this approach, A becomes the first p eigenvectors of X, and Z is the 
associated PCA scores. PCA can deal with an ultra large data scale and 
produces the most efficient representation of X using p dimensions. How- 
ever, PCA is an unsupervised learning technique which does not make use of 
the response variable to construct A. It is well known that the dimensions 
that best explain X will not necessarily be the best dimensions for predicting 
the response Y. Other variable combination methods include partial least 
squares regression and multidimensional scaling. All these approaches yield 
linear combinations of variables which makes interpretation more difficult. 

In this paper we propose a new supervised learning method which we call 
"Multivariate Adaptive Stochastic Search" (MASS). Our approach works 
by projecting a high dimensional data set into a lower dimensional space 
and then applying a classifier to the projected data. However, MASS has 
two key advantages over these other methods. First, when using a linear 
projection, such as given by (1), MASS uses a stochastic search process that 
is capable of automatically adapting the sparsity of A to generate opti- 
mal prediction accuracy. Hence, in situations where a subset of the original 
variables provides a good fit, MASS will utilize a sparse model, while in sit- 
uations where linear combinations of the variables work better, MASS will 
produce a denser model. MASS has the same advantage as other supervised 
methods in which A can be designed specifically to provide the best level 
of prediction accuracy. However, it has the flexibility to adapt the sparsity 
of A so as to gain the benefits of both the variable selection and variable 
combination methods. The second major advantage of MASS is that, with 
only a small adaption to the standard fitting procedure, it can project the 
original data into a nonlinear space. This generalization of (1) potentially 
allows for a very accurate projection into a low-dimensional space with little 
additional effort. 

MASS starts by generating a large set of prospective columns for A with 
a given sparsity level. A variable selection technique such as the Lasso is 
then applied to select a candidate subset of "good" columns or directions. 
Then, a new set of columns with a new sparsity level are produced as can- 
didates. The variable selection method must choose among the current set 
of good columns and the new candidates. Over time the same best columns 
are picked at each iteration and the process converges. At each step in the 
algorithm, the sparsity level of the new prospective columns is adjusted ac- 
cording to the sparsity level of the previously chosen columns. We show 
through extensive simulations and real world examples that MASS is highly 
robust in that it generally provides comparable performance to variable se- 
lection and variable combination approaches in situations that favor each of 
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these methods. However, MASS can still perform well in situations where 
one or another of these approaches fails. 

This paper is organized as follows. In Section 2 we present the basic MASS 
methodology. After outlining the linear fitting algorithm, we show how this 
can easily be extended to the nonlinear generalization. Furthermore, we pro- 
vide some theoretical motivation for MASS and discuss a preliminary data 
reduction which can be implemented before applying MASS. In Section 3 
we demonstrate the performance of MASS on an fMRI study and a gene mi- 
croarray study. In Section 4 we further study the performance of MASS on 
different scenarios by comparing MASS with several other modern potential 
classification techniques, such as fc-nearest neighbors, support vector ma- 
chines, random forests and neural networks, in extensive simulation studies. 
We briefly investigate some issues in implementing MASS, such as solution 
variability, computational cost and the problem of overfitting, in Section 5. 
A brief discussion summarizes the paper in Section 6. 



2. Projection selection with MASS. 



2.1. General ideas and motivations. Given predictors, Xj €M rf , and cor- 
responding categorical responses, yi, we model the relationship between yi 
and Xj as 

(2) yi|xj ~p(y*|zj), i = l,...,n, 

(3) z it j = fj(xi), j = l,...,p, 

where Zj = (zj.i, . . . , Zi :P ) for some p^id. Our general approach is to estimate 
the /j's, project the data into a lower p-dimensional sub-space, Zj, and then 
apply a standard classification method to fit (2). 

To make this problem tractable, we further assume that fj has an additive 
structure, fj{^-i) = Ylk=ifj,k{ x i,k), and, hence, (3) becomes 

d 

( 4 ) Zi,j=^2fj,k( X i,k)- 

k=l 

For any given Zij and Xj^'s, there are many functions, fj ^, that satisfy (4). 
However, to solve Equation (4), we constrain the flexibility of these functions 
by imposing the constraints, /^(O) = 0, and 



(5) J f>[ k (x) 2 dx<\, j = l,...,p,k = l,...,d,X<0. 

It should be noted that Equation (5) can trivially be made to hold by rescal- 
ing fj To prevent this occurring, we impose a further constraint on the 
first derivative of the /j,fc's, when /j 5 fc's are nonlinear; details are proved in 
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Appendix B. Using equation (5), small values of A constrain fj^ to be close 
to a linear function. In particular, setting A = implies fj,k( x ) = a j,kX, in 
which case (3) reduces to the linear projection given by (1). We first describe 
the linear MASS approach for fitting (2) and (4) subject to A = and then 
in Section 2.4 show how the procedure can easily be extended to the more 
general nonlinear case when A > 0. 

In the linear situation fitting our model given by (2) through (4) requires 
choosing the fj,k's, or equivalently estimating A in (1), and also selecting 
a classifier to apply to the lower dimensional data. We place most atten- 
tion on the former problem because there are many classification techniques 
that have been demonstrated to perform well on low-to-medium dimensional 
data. A more difficult question involves the best way to produce the lower 
dimensional data. Hence, we assume that one of these classification meth- 
ods has been chosen and concentrate our attention on the choice of A. This 
choice can be formulated as the following optimization problem: 

(6) A = argmin£ x ,y[e(A4A(X) J Y)] 

A 

subject to \\a.j \\ = 1, 

where Ma is the classification method applied to the lower dimensional 
data, X and Y are the predictors and response variables, and e is a loss 
function resulting from using Ma to predict Y. The constraint is that each 
column of A should be norm 1. A common choice for e is the 0-1 loss 
function which results in minimizing the misclassification rate (MCR). Since 
A is high dimensional, solving (6) is in general a difficult problem. 

There are several possible approaches to optimize (6). One option is to 
assume that A is a sparse 0, 1 matrix and only attempt to estimate the 
locations of its non-zero elements. This is the approach taken by variable 
selection methods. Another option is to assume A is dense but, instead of 
choosing A to optimize (6), select a matrix which provides a good represen- 
tation for X. This is the approach taken by PC A. One then hopes that the 
PCA solution will be close to that of (6). 

Instead of making restrictive assumptions about A, as with the variable 
selection approach, or failing to directly optimize (6), as with the PCA ap- 
proach, we attempt to directly fit (6) without placing restrictions on the 
form of A. In this type of high dimensional nonlinear optimization prob- 
lem, stochastic search methods, such as genetic algorithms and simulated 
annealing, have been shown to provide superior results over more tradi- 
tional deterministic procedures because they are often able to more effec- 
tively search large parameter spaces, can be used for any class of objective 
functions and yield an asymptotic guarantee of convergence Gosavi (2003), 
Liberti and Kucherenko (2005). We explore a stochastic search process and 
demonstrate that it is highly effective at searching the parameter space and 
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generally requires significantly fewer iterations than other possible stochastic 
approaches Tian, Wilcox and James (2009). 

2.2. The MASS method. The linear MASS procedure works by succes- 
sively generating a large number, L, of potential random directions, that 
is, SLj's. We then use Equation (1) to compute the corresponding L dimen- 
sional data space, z\, Z2, . . . , zl, and use a variable selection method to select 
a "good" subset of these directions to form an initial estimate, A*. The spar- 
sity level of this A* is examined and a new random set of potential columns 
is generated with the same average sparsity as the current A*. The proce- 
dure iterates in this fashion for a fixed number of steps. Formally, the MASS 
procedure consists of the following steps: 

Step 1. Randomly generate the initial d-by-L transformation matrix A*(°) 
(p < L < d) with an expected sparsity of . 

Step 2. At the Ith iteration, use Equation (1) and A*^ to obtain a prelim- 
inary reduced data space Z*0. Evaluate each variable of Z*^ by 
fitting a model Y ~ Z, and select the p most "important" variables 
in terms of the model. 

Step 3. Keep the corresponding p columns of A*^ and calculate the average 
sparsity, for these columns. 

Step 4. Generate L — p new columns with an average sparsity of £w . Join 
these columns with the p columns selected in Step 3 to form a new 
transformation matrix A*( l+1 \ 

Step 5. Return to Step 2 for a fixed number of iterations. 

Implementing this approach requires the choice of a variable selection 
procedure for Step 2. Potentially any of a large range of standard methods 
can be chosen. We discuss various options in Section 2.6. 

A more crucial part of implementing MASS is the mechanism for gener- 
ating the potential columns for A*. We define the current level of sparsity, 
as the fraction of zero elements in A^. Then at each iteration of MASS 
we generate new potential columns with the same average sparsity level as 
the columns selected in the previous step. This allows ^ to automatically 
adjust to the data set. The idea is that a data set requiring high ^ will tend 
to result in sparser columns being selected and the current A will become 
sparser at each iteration. Alternatively, a data set that requires a denser 
A will select dense columns at each iteration. While at each step of the 
stochastic search the overall average sparsity is restricted to equal we 
desire some variability in the sparsity levels so that MASS is able to select 
out columns with higher or lower sparsity and hence adjust ^ for the next 
iteration. To achieve this goal, we allow for different average sparsity levels 
between columns. 
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In particular, we generate the (k,j)th element of A* using 

(7) a k j = u k jv k ,j, k = l,...,d,j =p + l,...,L, 

(8) u fcj ~A/"(0,l), v htj ~B(l-£j), 

where B(ir) is the Bernoulli distribution with probability of 1 equal to ix. In 
the (I + l)th iteration, we let 



(9) ~ Beta (a, ^p^) , j=p + l,...,L, 

where is the sparsity of the jth column of A*^ +1 ) . Note that E^j 1 ' ) = 
£w for all values of a. We found a = 5 produced a reasonable amount of 
variance in sparsity levels. 

We then combine these L — p columns with the selected p columns from 
iteration / to form the new intermediate transformation matrix A*(' +1 ). We 
select = 0.5 for the initial sparsity level, which seems to provide a rea- 
sonable compromise between the variable selection and variable combination 
paradigms. 

The full MASS algorithm is explicitly described as follows: 

1. Set £W =0.5 and generate an initial A*W using (7) through (9). Calcu- 
late Z*<°) by Equation (1). 

2. Select p variables from Z*( ' and keep the corresponding p columns from 
A*(°) to obtain A^. 

3. Iterate until I = I. 

(a) Generate L — p new directions A* cw by using (7) through (9). 

(b) Let A*W = (A^" 1 ), A* cw ) and use Equation (1) to obtain Z*W. 

(c) Select p variables from Z*^. 

(d) Keep the corresponding p columns from A*^ to obtain A^ . 

(e) Calculate £W for A«. 

(f) Go to (a). 

4. Apply the selected classification technique to the final Z^). 



2.3. Theoretical justification. Here we show that MASS will asymptoti- 
cally select the correct sub-space, provided a "reasonable" variable selection 
method is utilized in Step 3(c). Assumption 1 below formally defines reason- 
able. Suppose our variable selection method must choose among z±, . . . ,Z£ 
potential variables. Let Zo € W nxp represent the p-dimensional set of true 
variables. Note Zq is not necessarily a subset of zi, . . . , zx,. Define Z n 6 M. nxp 
as the p variables among z±, Z2, • ■ • , zl that minimize ||Z„ — Zo|| 2 for a sam- 
ple of size n. Then we assume that the variable selection method chosen for 
MASS satisfies the following property: 
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Assumption 1. There exists some e > such that, provided 

(10) -||Z n -Z || 2 <£, 

n 

then Z„ is chosen by the variable selection method almost surely as n — > oo. 

Assumption 1 is a natural extension of the definition of consistency of a 
variable selection method, namely, that it asymptotically selects the correct 
model. Here we extend this idea slightly by assuming that, provided a set 
of candidate predictors that is arbitrarily close to the true predictors is 
presented, the variable selection method will asymptotically choose these 
variables. Theorem 1 provides some theoretical justification for the MASS 
methodology. 

Theorem 1. Let Z^ represent the p variables selected by MASS after 
performing I iterations on a sample of size n. Then, under the linear sub- 
space model given by (1) and (2), provided a variable selection method is 
chosen such that Assumption 1 holds, as n and I approach infinity, 

-IIZ^ -Z || 2 ^0 a.s. 
n 

The proof of Theorem 1 is given in Appendix A. Theorem 1 guarantees 
that, provided a reasonable variable selection method is chosen, then the 
sub-space chosen by MASS will converge to the "true" sub-space, in terms 
of mean squared error, as n and / converge to infinity. Note, Theorem 1 does 
not assume that (10) holds; Only that, if it does hold, then asymptotically 
Z n will be selected. 

2.4. The nonlinear generalization. Recently, nonlinear reduction work 
has mostly concentrated on local neighborhood methods such as Isomap 
Tenenbaum, de Silva and Langford (2000) and LLE Roweis and Saul (2000). 
One limitation of these approaches is that they are clustering based and 
hence are unsupervised methods. Another limitation is that these approaches 
only consider local feature spaces. They perform well when the data belong 
to a single well sampled cluster, but fail when the points are spread among 
multiple clusters. MASS can also produce a nonlinear reduction without the 
aforementioned problems. 

Recall that MASS attempts to compute the fjk s subject to (5) holding for 
some A > 0. It is not hard to show that, among all functions that interpolate 
a given set of points, the one that minimizes the integrated squared second 
derivative will always be a natural cubic spline Reinsch (1967). Hence, since 
we wish to choose a set of functions that reproduce the z^fs subject to 
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(5), it seems sensible to model each fjk using a ^-dimensional natural cubic 
spline (NCS) basis, b(i). 

Using this formulation, (4) becomes zy = Ylt=i fj,k( x i,k) = Ylt=i W x «,fc) T x 
9j t k, where Oj^ represents the basis coefficients for fj^. Since the Zi j s are 
just linear functions of the 0's, we can rewrite (4) in the simpler linear form, 

a), 

(ii) z = x*e, 

where X* = (B(xi)|B(x 2 )| ■ ■ ■ \B(x d )) 6 R nx (9 x<i ), Q = (0 ls . . . , P ) G R(<? xd ) x P, 
Oj = (ej A , . . . , 9l d ) T , and B(xfc) = (b(xi, fc ), . . . , b(x n , fc )) T . 

The only complication in using the standard linear MASS methodology 
to fit (11) is ensuring that (5) holds. However, in Appendix B we show that 
some minor adaptations to (8) ensure that (5) holds for all candidate 0's that 
we generate. This is one of the advantages of the stochastic search process — it 
is easy to search only the feasible values of the space. In all other respects, 
the MASS methodology as outlined in Section 2.2 can be applied without 
any alterations to estimate a non-linear sub-space of the data. It should also 
be noted that Theorem 1 can be extended to the nonlinear setting, provided 
we assume that the true nonlinear sub-space satisfies (5). 

2.5. Preliminary reduction. MASS can, in general, be applied to any 
data. However, Fan and Lv (2008) argue that a successful strategy to deal 
with ultra-high dimensional data is to apply a series of dimension reduc- 
tions. In our setting this strategy would involve an initial reduction of the 
dimension to a "moderate" level followed by applying MASS to the new 
lower dimensional data. This two stage approach potentially has two major 
advantages. First, Fan and Lv (2008) show that prediction accuracy can be 
considerably improved by removing dimensions that clearly appear to have 
no relationship to the response. Our own simulations reinforce this notion. 
Second, stochastic search algorithms such as MASS can require significant 
computational expense. Reducing the data dimension before applying MASS 
provides a large increase in efficiency. 

In this paper we consider three methods for reducing the data into m 
[m > p) dimensions. The first approach is conventional PCA, which has the 
effect of selecting the best m dimensional space in terms of minimizing the 
mean squared deviation with the original d dimensional space. The second 
approach is the sure independence screening (SIS) method based on correla- 
tion learning Fan and Lv (2008). It computes the componentwise marginal 
correlations between each predictor and the response vector. One then selects 
the variables corresponding to the m largest correlations. PCA is an unsu- 
pervised variable combination approach, while SIS is a supervised variable 
selection method. PCA may exclude important information among variables 
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with less variability, while SIS may tend to select a redundant subset of pre- 
dictors that have high correlations with the response individually but are 
also highly correlated among themselves. Our third dimension reduction ap- 
proach, which we call PCA-SIS, attempts to leverage the best of both PCA 
and SIS by first using PCA to obtain n orthogonal components and then 
treating these components as the predictors and using SIS to select the best 
m in terms of correlation with the response. PCA-SIS can be thought of as 
a supervised version of PCA. We compare these three types of preliminary 
reduction methods, along with the effect of performing no initial dimension 
reduction, in our simulation studies. 

Fan and Lv (2008) argue that the dimension of the intermediate space m 
should be chosen as 

, , 2n 
12 m =—- 

\og(n) 

As we have found that this approach generally works well, we have adopted 
Equation (12) for selecting m in this paper. 

2.6. Implementation issues. MASS requires the choice of a variable selec- 
tion technique. In principle, any variable selection technique can be applied 
here. We considered several possible methods. In the context of classifica- 
tion, a natural approach is to use a GLM version of the Lasso using a logis- 
tic regression framework. We examined this approach using the GLMpath 
methodology of Park and Hastie (2007). Interestingly, we found that simply 
using the standard Lasso procedure to select the variables gave similar levels 
of accuracy to GLMpath and required considerably less computational ef- 
fort. Therefore, in our implementation of MASS we use the first p variables 
selected by the Lars algorithm Efron et al. (2004). However, in practice, 
one could implement our methodology with any standard variable selection 
method such as SCAD, Dantzig selector, etc. There is an extensive litera- 
ture examining the circumstances under which the Lasso will asymptotically 
select the correct model [see, e.g., Knight and Fu (2000), Tsybakov and van 
de Geer (2005)]. Hence, it seems reasonable to suppose that Assumption 1 
in Section 2.3 will hold. 

To implement MASS, one must choose values for the number of iterations, 
I, the number of random columns to generate, L, and the final number of 
columns chosen, p. In our experiments we used / = 500 iterations. We found 
this value guaranteed a good result and often fewer iterations were required. 
In general, tracking the deviance of the Lasso at each iteration provided a 
reliable measure of convergence. The value of L influences the convergence 
speed of the algorithm as well as the execution time. We found that the 
best results were obtained by choosing L to be a relatively large value for 
the early iterations but to have it decline over iterations. In particular, we 
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set L = n/2 for the first iteration and had it decline to 2p by the final 
iteration. This approach is similar in spirit to simulated annealing where 
the temperature is lowered over time. It guarantees that the Lasso has more 
variables to choose from at the early stages, but then as the search moves 
toward the optimum, decreasing L will improve the reliability of the selected 
p variables in addition to speeding up the algorithm. 

The choice of p can obviously have an important impact on the results. In 
general, p should be chosen as some balance of the classification accuracy and 
the computational expense. One reasonable approach is to use an objective 
criterion to choose p, such as cross-validation (CV). In other situations, 
some prior knowledge can be applied to select p. For example, in the fMRI 
study that we examine in Section 3, prior knowledge and assumptions on 
the regions of interest can be used to decide on a reasonable value for p. 

3. Applications. In this section we apply linear MASS to two data sets 
from real studies: an fMRI data set and a gene microarray data set. As 
a comparison to MASS, we also apply classic logistic regression (LR) or 
support vector machine (SVM) to the lower dimensional data produced using 
a straight Lasso method (Lars), a generalized Lasso method (GLMpath), SIS 
and PCA. They all utilize equation (1) but compute A directly using their 
own methodologies. In both studies, we use 500 iterations for MASS. 

3.1. fMRI brain imaging data. The fMRI data was obtained from the 
imaging center at the University of Southern California. The raw data con- 
sisted of 200 3-D brain images recording the blood oxygen level dependent 
(BOLD) response for a subject who was conducting a visual task. After pre- 
processing, each image contained 11,838 voxels. One research question was 
to divide the 200 images into task (96) and baseline (104) images based on 
the 11,838 voxels. To answer this question, we randomly divided the data 
to training (150) and test (50) samples. Since d>n, a preliminary reduc- 
tion was needed. The intermediate scale was m = 60 by Equation (12). We 
tested p= 10,20,30,40,50,60, which were considered to be good balances 
of the execution time and the classification accuracy. SVM was used as the 
base classifier. 

Table 1 reports the minimum test MCR, MCR*, and its corresponding 
p, p* , for each method. For each p, we applied MASS twenty times on the 
training data and obtained the average MCR on the test data. We then 
reported the minimum average test MCR, and its corresponding p. In this 
table as well as in the rest of the paper, SIS-MASS means that we first used 
SIS to reduce the data and then applied MASS. Similarly, PCA-SIS-Lars 
means that PCA-SIS was the first reduction method and then Lars was 
applied, etc. 
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Table 1 

Test MCR* and p* on fMRI data using SVM 



Methods 


Test MCR* 


Optimal p* 


Lars 


0.20 


50 


GLMpath 


0.22 


50 


SIS 


0.40 


30 


PCA 


0.26 


40 


SIS-Lars 


0.36 


30 


PCA-Lars 


0.22 


40 


PCA-SIS-Lars 


0.12 


50 


SIS-GlMpath 


0.38 


30 


PCA-GLMpath 


0.22 


40 


PCA-SIS-GLMpath 


0.10 


50 


SIS-MASS 


0.373 (0.008) 


50 


PCA-MASS 


0.155 (0.005) 


40 


PCA-SIS-MASS 


0.042 (0.005) 


50 



Obviously, PCA-SIS-MASS dominates other methods. We show the train- 
ing deviance, the average test MCR and the average sparsity as some func- 
tion of the number of iterations in Figure 1. As we can see, the training 
deviance and the average test MCR decline rapidly and then level off, indi- 
cating the model improves quickly. The average £ path for PCA-SIS-MASS 
indicates MASS chose a relatively sparse matrix as the optimal transforma- 
tion matrix. We also examined the relative performance of MASS in com- 
parison to a representative sampling of some modern classification methods 
[Support Vector Machines (SVM), fc-Nearest Neighbors (kNN), Neural Net- 
works (NN) and Random Forests (RF)] on the m-dimensional preliminary 
reduced data. As shown in Table 2, all four methods were inferior to MASS. 

3.2. Leukemia cancer gene microarray data. The second real-world data 
set was the leukemia cancer data from a gene microarray study by Golub et 
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Fig. 1. Preliminary reduction by PCA-SIS-MASS on fMRI data (p = 30). 
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Table 2 

Test MCR on the preliminary reduced data by different classifiers 





SVM 


kNN 


NN 


RF 


SIS- 


0.36 


0.40 


0.44 


0.34 


PC A- 


0.26 


0.26 


0.38 


0.22 


PCA-SIS- 


0.24 


0.20 


0.36 


0.20 



al. (1999). This data set contained 72 tissue samples, each with 7129 gene 
expression measurements and a cancer type label. Among the 72 tissues, 
25 were samples of acute myeloid leukemia (AML) and 47 were samples 
of acute lymphoblast leukemia (ALL). Within the 38 training samples, 27 
were ALL and 11 AML. Within the 34 test samples, 20 were ALL and 14 
AML. In some previous studies Fan and Fan (2008), Fan and Lv (2008), 
16 genes were ultimately chosen. In this study, we also picked p = 16 genes 
for MASS. However, for the other counterpart methods, we examined all 
possible p's, that is, p = 1, 2, . . . , 21, to obtain MCR* and their corresponding 
p*'s. This is considered to be an advantage for the counterpart methods and 
a disadvantage for MASS. The intermediate dimension was chosen to be 
m = 21 by Equation (12). A LR model was applied for classification. Similar 
to the fMRI study, PCA-SIS-MASS provided the lowest MCR (see Table 3), 
and it was better than the nearest shrunken centroids method mentioned in 
Tibshirani et at. (2002), which obtained a MCR of 2/34 = 0.059 based on 
21 selected genes Fan and Lv (2008). Figure 2 shows the resulting graphs. 
The sparsity of the transformation matrix leveled off at about 0.41. 

4. Simulation studies. In this section we further investigate the perfor- 
mance of MASS under different conditions. We give five simulation examples. 
The first simulation investigates the nonlinear setting, while the remaining 
simulations concentrate on the linear case. In all simulations we show re- 
sults with MASS applied using the LR classifier and the SVM classifier. In 
addition, we also apply LR and SVM to the original full dimensional data 



Table 3 

Test MCR* and corresponding p* or standard errors (in the parentheses) on leukemia 

cancer data using LR 







Lars 




GLMpath 


MASS (p = 16) 


m = 21 












SIS- 


0.147 (p* 


= 12,14-17,19,20) 


0.088 (p* 


= 16,17,20,22-24) 


0.176 (0.010) 


PCA- 


0.029 (p* 


= 13, 15-17) 


0.029 (p* 


= 9) 


0.056 (0.008) 


PCA-SIS- 


0.029 (p* 


= 16) 


0.029 (p* 


= 15) 


0.004 (0.002) 
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(FD) and to lower dimensional data produced using Lars, GLMpath, SIS 
and PCA. 

4.1. Simulation study I: Nonlinear data. In this setting we simulated 
data using the nonlinear additive model assumed by MASS. We first selected 
an NCS basis and then generated 0j,fc's such that (5) holds for a given A. 
We used the same generation scheme as that used by the MASS algorithm. 
Details of the generation process are provided in Appendix B. 

We simulated 20 training samples, each containing 100 observations and 
50 predictor variables generated from A/"(0, 1). The binary response variable 
was associated with the true Zo € K nxpo using a standard logistic regression 
model. Here we let po = 2. For each training data set a corresponding set of 
1000 observations was generated as a test sample. In order to demonstrate 
the importance of the sparsity of the transformation matrix to the classifica- 
tion performance, we also implemented a modified version of MASS where 
we fixed the sparsity of the transformation matrix, £, so that it did not 
change over iterations. We call this method the multivariate fixed stochastic 
search (MFSS) method. Presumably, MFSS with a good value of sparsity 
will perform better than MASS, because MASS has to adaptively search for 
£. For MASS and MFSS, we let p = p = 2 dimensions. 

We considered two scenarios, where the true curvatures were Ao = 5 and 
Ao = 10. According to Equation (5), A is one of the indicators of the model 
complexity. With a larger A, the model is more complex. The true sparsity 
for both scenarios were £o = 0.3. The counterpart methods are Lars and 
PCA with their optimal p*'s. We examined all possible values for p, that is, 
p = 1, 2, . . . , 50, for Lars and PCA and report their MCR* in each simulation 
run. Table 4 shows the average test MCR using LR and SVM classifiers. 
When we let A = Ao, MASS and MFSS constantly produced good results. 
In particular, MFSS with Ao was significantly better than other methods. 
MFSS with incorrect A's were inferior to MFSS with the correct A, but 
were still superior to Lars and PCA. Lars and PCA each suffered from an 
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Table 4 

Average test MCR (standard errors) in Simulation I 



A = 5,£o = 0.3 A = 10, ^o=0.3 



LR SVM LR SVM 



FD 0.424 (0.005) 0.414 (0.005) 0.412 (0.006) 0.389 (0.007) 

Lars (with p*) 0.398 (0.006) 0.391 (0.007) 0.407 (0.007) 0.405 (0.008) 

PCA (with p*) 0.424 (0.008) 0.426 (0.008) 0.487 (0.005) 0.489 (0.004) 

MASS (A = Ao) 0.234 (0.008) 0.239 (0.007) 0.302 (0.007) 0.300 (0.006) 

MFSS (A = 0,{ = 0.3) 0.289 (0.009) 0.284 (0.008) 0.352 (0.006) 0.351 (0.007) 

MFSS (A = 5, £ = 0.3) 0.222 (0.008) 0.230 (0.008) 0.315 (0.005) 0.309 (0.006) 

MFSS (A = 10, C = 0.3) 0.378 (0.007) 0.392 (0.008) 0.279 (0.006) 0.275 (0.005) 



inability to match the nonlinear structure of the data even with the optimal 
p* . When the model is very complex (e.g., A = 10), overfitting may occur 
in MASS even with a small number of iterations. We discuss the potential 
problem of overfitting in Section 5. 

4.2. Simulation study II: Sparse A case. This simulation was designed 
to examine the performance of MASS in a sparse model situation where the 
response was only associated with a handful of predictors. The training and 
test samples were generated from A/"(0,S), where the diagonal elements of 
£ were 1 and the off-diagonal elements were 0.5. We then rescaled the first 
Po = 5 columns to have a standard deviation of 10. We call these columns 
with the most variability the major columns, and the rest the minor columns. 

We tested two scenarios by creating two different true transformation 
matrices. In the first scenario the true transformation matrix was Ao = 
perm 1 ([I po , 0] T ), where perm 1 was the row permutation that made the unit 
row vectors in Ao associate with the major columns of the data matrix. In 
the second scenario the true transformation matrix was Ao = perm 2 ( [I po , 0] T ) , 
where perm 2 was the row permutation that made the unit row vectors in 
Ao associate with any pq of the minor columns. In both scenarios, the true 
dimension of the subspace is po = 5. Again, the group labels were generated 
by a standard logistic regression model: 

zT B 

(13) Pr( i/< = l|Z i ) = - 6 ' 

l + e z i p 

where Zi is the ith. point in the reduced space and (3 is a po-dimensional 
coefficients vector of the logistic regression. The elements of /3 are generated 
from some uniform distributions, that is, U{— 0.5,0.5) for scenario 1 and 
U{— 2, 2) for scenario 2, in order to make the Bayes error rates for both 
scenarios remain roughly around 0.1. We expect that PCA should perform 



16 



T. S. TIAN, G. M. JAMES AND R. R. WILCOX 



best in scenario 1 because this case matches the PCA working mechanism 
exactly; on the other hand, it should perform poorly in scenario 2 because 
the first p eigenvectors tended to concentrate on the major columns where 
no group information resides. 

We ran Lars, GLMpath, SIS and PCA for multiple values of p, and for 
Lars, GLMpath and SIS, p = 5 is the best value, p* . So we reported the 
average test error when p = 5 for these methods. We mandatorily assigned 
p = 5 to MASS and MFSS (with f = £ = 0.98). This was a disadvantage 
for these methods. We also calculated the average Bayes rates. As is shown 
in Table 5, not surprisingly, LR generally outperformed SVM on this data 
because the data were generated using a logistic regression model. As ex- 
pected, PCA worked well in scenario 1 but poorly in scenario 2. One of the 
reasons the performance of PCA is so poor in scenario 2 is that we used 
p = 5, not the p* . If the p* is used, the performance of PCA in scenario 2 
may be improved. The supervised methods had an advantage in the latter 
scenario because they always looked for A's with high predictive power. 
MASS and MFSS performed well in both scenarios. Note that since the true 
model was highly sparse, the variable selection methods (Lars and GLM- 
path) performed well too. As we can see from Figure 3, the average test 
MCR and deviance in both scenarios were decreasing, which means MASS 
tended to choose "good" directions from A* over iterations, and hence, the 
classification accuracy was improved gradually. 

It is interesting to see that the sparsity levels for the estimated projec- 
tion matrices of these two scenarios were different, although the sparsity for 
the true projection matrices were in fact the same: £o = 0.98. One possible 
reason for the difference is that while A's may have elements close to zero 
which make little contribution in extracting variables, these elements still 
contribute to its "denseness" according to our sparsity definition. Another 
possible reason is that for scenario 1, where all information was located in 
major columns, the noise level is low. The inclusion of minor columns does 
not highly influence classification accuracy. However, in scenario 2, since 
the noise level is high, the search must find the correct sparsity in order to 
improve accuracy. This explains why MFSS with £ = 0.98 performed worse 
than MASS in scenario 1 but better in scenario 2. 

4.3. Simulation study III: Dense A case. In this simulation we investi- 
gated a dense model. We simulated the input data X (the same size as in 
Simulation II) from AA(0,S), where the diagonal elements of £ were 1 and 
the off-diagonal elements were 0.5. Unlike the sparse case, the elements of 
the true 5-dimensional Ao were generated from the standard normal dis- 
tribution. Since the Ao matrix is dense, the columns in Z will be linear 
combinations of columns in X. If we use a linear LR model (13) to generate 
group labels as in Simulation II, the generated group labels will depend on 
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Table 5 

Bayes rates and average test MCR in Simulation II 



Scenario Classifier 


Bayes 


FD 


Lars 


GLMpath 


SIS 


PCA 


MASS 


MFSS 


1 LR 


0.114 


0.268 


0.150 


0.149 


0.192 


0.114 


0.130 


0.154 




(0.007) 


(0.011) 


(0.012) 


(0.012) 


(0.021) 


(0.009) 


(0.007) 


(0.008) 


SVM 




0.254 


0.166 


0.166 


0.213 


0.139 


0.136 


0.157 






(0.017) 


(0.013) 


(0.013) 


(0.022) 


(0.008) 


(0.006) 


(0.008) 


2 LR 


0.112 


0.273 


0.143 


0.148 


0.217 


0.477 


0.184 


0.141 




(0.006) 


(0.009) 


(0.009) 


(0.010) 


(0.017) 


(0.006) 


(0.012) 


(0.006) 


SVM 




0.255 


0.164 


0.169 


0.242 


0.485 


0.189 


0.152 






(0.018) 


(0.010) 


(0.011) 


(0.019) 


(0.007) 


(0.012) 


(0.006) 



The Average Training Deviance The Average Tesl Error Path on LDs The Average Sparsity 




100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 



Fig. 3. MASS performance graphs in Simulation II. Upper panels are scenario 1 and 
lower panels scenario 2. 

a linear combination of the columns of Z. Therefore, the group labels are 
actually directly related to a linear combination of all the columns of X, 
thus, removing the concept of a lower dimensional space. Hence, to ensure 
the response was a function of the lower dimensional space, we generated 
the group labels using a nonlinear logistic regression, 

(14) Pr (y4 = l|Z 4 ) = IT ^ y , 
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Table 6 

Average minimum MCR from Simulation III 







LR 




SVM 


P* 


MCR* 


P* 


MCR* 


Bayes 




0.082 (0.002) 




FD 




0.339 (0.006) 




0.317 (0.007) 


Lars 


21 


0.319 (0.006) 


.37 


0.306 (0.006) 


GLMpath 


1!) 


0.320 (0.006) 


2.-) 


0.308 (0.005) 


SIS 


25 


0.321 (0.005) 


43 


0.305 (0.006) 


PCA 


32 


0.329 (0.005) 


35 


0.326 (0,006) 


MASS(p = 5) 




0.271 (0.010) 




0.245 (0.008) 


MFSS(p = 5) 




0.239 (0.008) 




0.212 (0.007) 



where g(Z{) = sin(0.057rZj) T /3. This guarantees that most 0.057rZj values 
fall into the range of (— vr/2, vr/2). Hence, the nonlinearity is produced by 
that particular part of a sine function. 

Since the data scale was moderate and all the variables contain group 
information, the optimal p* for different methods may be different and not 
always be pq. We reported the average minimal MCR*'s and its correspond- 
ing p*'s for Lars, GLMpath, SIS and PCA, respectively. As to MASS and 
MFSS, we still fix p = 5, which is considered a disadvantage for these two 
but an advantage for other methods. Since Ao is dense, we assigned £ = 
to MFSS. Table 6 lists the average MCR* and p*'s based on 20 pairs of 
training and test data. Figure 4 shows the MASS performance graphs. As 
can be seen, MASS and MFSS are still superior when all the methods use 
their optimal p*'s, with MFSS providing the best results. Figure 5 shows the 
test MCR values by Lars, GLMpath, SIS and PCA with different p's. These 
methods all tend to use larger numbers of variables. 




Fig. 4. MASS performance graphs in Simulation III with p = 5. 
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Fig. 5. Average test MCR for all values of p in Simulation III. Upper panels: LR; lower 
panels: SVM. 

4.4. Simulation study IV: Contaminated data. In order to examine the 
robustness of MASS against outliers, we created a situation where the dis- 
tributional assumptions were violated. We generated the input data from 
a multivariate g-and-h distribution Field and Genton (2006) with g = h = 
(0.5, . . . ,0.5) T € M 50 . As with Simulation III, we used Equation (14) to cre- 
ate the group labels, except that here we used g(Zi) = sin(0.0057rZi) T /3. 
Since this is a highly skewed and heavy-tailed case, there are many extreme 
values. Hence, the data domain is rather large. We use sin(0.0057rZj) so that 
most 0.0057rZj values fall into the range of (— vr/2, 7r/2). As with Simulation 
III, only half period of a sine curve is effective. 

MASS and MFSS were performed with a fixed p = 5, while all other 
methods used their p*'s. MFSS was implemented with £ = 0. The results are 
listed in Table 7. As with Simulation III, MASS and MFSS with a fixed p = 5 
outperformed other methods with their p*'s. All other methods performed 
poorly on these data. Compared to Simulation III, the performance of MASS 
and MFSS did not decline as much as other methods. This demonstrates that 
the proposed method is not very sensitive to outliers. 

4.5. Simulation study V: Ultra high dimensional data. In this section 
we simulated an ultra high dimensional situation where n = 100, d = 1000. 
Thus, a preliminary dimension reduction was needed. Among the 1000 vari- 
ables, only 50 contained the group information. These 50 informative vari- 
ables were generated from a multivariate normal distribution with variance 1 
and correlation 0.5. The 950 noise variables were generated from A/"(0,0.5I) 
in order to achieve a reasonable signal-to- noise ratio. Letting po = 5, the Zq 
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was generated from the 50 informative variables through a dense Ao, with 
each element generated from the standard normal distribution. The group 
labels were still generated by Equation (14). 

We used the three aforementioned preliminary reduction methods, SIS, 
PCA and PCA-SIS, to reduce the data into m = 50 dimensions. Presumably, 
if the preliminary reduction can extract the m informative variables, MASS 



Table 7 

Average minimum MCR from Simulation IV 







LR 




SVM 


* 

P 


MCR* 


* 

p 


MCR* 


Bayes 




0.068 (0.002) 




FD 




0.482 (0.006) 




0.490 (0.007) 


Lars 


29 


0.403 (0.004) 


20 


0.391 (0.004) 


GLMpath 


37 


0.401 (0.005) 


48 


0.389 (0.006) 


SIS 


39 


0.410 (0.005) 


31 


0.392 (0.005) 


PCA 


35 


0.412 (0.005) 


31 


0.399 (0.005) 


MASS(p = 5) 




0.294 (0.006) 




0.285 (0.007) 


MFSS(p = 5) 




0.273 (0.005) 




0.266 (0.007) 


SIS 




PCA 




PCA-SIS 







Fig. 6. Average test MCR for all values of p on the first reduced data by Lars in Simu- 
lation V. Upper panels: using LR; lower panels: using SVM. 
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Table 8 

Average test MCR (standard errors) in Simulation V 





Classifier 


Lars (with p* ) 


GLMpath (with p*) 


MASS 


MFSS 


SIS- 


LR 


0.286 (0.012) 


0.288 (0.012) 


0.124 (0.007) 


0.150 (0.009) 




SVM 


0.250 (0.015) 


0.253 (0.015) 


0.121 (0.008) 


0.155 (0.008) 


PC A- 


LR 


0.348 (0.015) 


0.347 (0.013) 


0.165 (0.011) 


0.182 (0.011) 




SVM 


0.353 (0.014) 


0.356 (0.013) 


0.157 (0.011) 


0.179 (0.012) 


PCA-SIS- 


LR 


0.337 (0.013) 


0.335 (0.012) 


0.119 (0.010) 


0.125 (0.011) 




SVM 


0.325 (0.011) 


0.327 (0.010) 


0.102 (0.011) 


0.119 (0.010) 



and MFSS (with £ = 0) will perform well. We also implemented Lars and 
GLMpath on the preliminary reduced data. For Lars and GLMpath, we 
reported their average test MCR* and p*'s, while for MASS and MFSS we 
still fixed p = 5. Figure 6 shows the average test MCR from different values 
of p on the preliminarily reduced data using Lars as the further reduction 
method. The test MCR for GLMpath looks similar to Lars and thus is not 
shown. As we can see, no matter what preliminary reduction method is 
applied, Lars tends to choose larger values of p* than pq. However, even 
with the p*'s, the test MCR* are above 0.250. In particular, when PCA 
and PCA-SIS are applied, MCR* are above 0.320. SIS seems to be a better 
preliminary reduction method for Lars and GLMpath in this simulation. 

The average test MCR, using LR and SVM classifiers, are shown in Ta- 
ble 8. As we can see, all six implementations of MASS and MFSS were 
statistically significantly better than Lars and GLMpath with their p*'s. In 
particular, SIS-MASS and PCA-SIS-MASS generally provided the best re- 
sults. It is not surprising that PCA or PCA-SIS as the preliminary reduction 
methods did not fail, because the signal-to-noise ratio was not large enough 
in this study. This indicates that the improvement of the classification ac- 
curacy by MASS is not only caused by the preliminary reduction but more 
by the method itself. 

In order to further demonstrate that the improvement of the classification 
accuracy was not due mainly to the preliminary reduction but to the MASS 
method, we also applied SVM, kNN, NN and RF on the 50-dimensional 
reduced data without performing any further reduction. The results are 
shown in Table 9. As can be seen, MASS and MFSS performed extremely 
well relative to those approaches. 

5. Computational issues. In this section we discuss several issues associ- 
ated with MASS, including the solution variability, the computational issue 
and the potential problem of overfitting. 

Regarding the randomized nature of MASS, one issue is the variability of 
the solution, that is, the variability of the selected A for a fixed training and 
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test pair from different simulation runs. Theoretically, a space is determined 
by its orthogonal basis. We presume two spaces are the same as long as they 
have the same basis even though they can be rotated differently. Therefore, 
A's can be treated the same if the Z's they produce have the same princi- 
pal components (PC). Hence, we examine the first PC of Z's produced by 
different simulation runs. 

We examine three different cases. The first two are the two scenarios in 
Simulation II and the third is the fMRI data in Section 3.1. For each of the 
first two cases, we generate a training data set of 100 observations paired 
with a test data set of 1000 observations. In each paired data set, we apply 
MASS 100 times and obtain 100 Z's. We then extract the first PC for each 
Z on the test data and calculate the average absolute pairwise correlations 
between MASS runs: 



where p Sj t is the correlation of the first PC between the sth and the tth runs. 

For the fMRI data, we fixed the training data (150 observations) and 
test data (50 observations), and conducted a preliminary reduction using 
PCA-SIS to reduce the data space to 60 dimensional. The |ppci| was also 
calculated. In the first tw o cases, we set p = 5 and in the third, we set p = 50. 

Table 10 shows |ppci|> the proportion of variance that the first PC con- 
tains, and the standard errors of test MCR from 100 runs. In all cases, 
|/?PCi|'s are reasonably high (all above 0.70). Particularly, in the first case, 
|/Opci| = 0.968, which indicates the first PC accounting for 89% of the data 
variance, are highly consistent based on 100 simulation runs. PCs containing 
larger variance are more consistent than PCs with less variance. In addition, 
the SE for MCR are fairly small for all the cases (e.g., 0.006, 0.007 and 0.005, 
respectively). All these indicate the solution produced by MASS is fairly sta- 



We compared MASS on the fMRI data, in terms of both classification 
accuracy and the computational time, to the SA-Sparse method by Tian, 
Wilcox and James (2009), which is also a stochastic search procedure. The 
approach of Tian, Wilcox and James (2009) provides a useful comparison 



Average test MCR on the 50-dimensional first reduced data using different classifiers 




s<t 



blc. 



Table 9 



SVM 



kNN 



NN 



RF 



SIS- 0.349 (0.017) 0.356 (0.015) 0.365 (0.016) 0.344 (0.017) 

PCA- 0.358 (0.013) 0.377 (0.013) 0.323 (0.013) 0.350 (0.015) 

PCA-SIS- 0.359 (0.012) 0.364 (0.014) 0.318 (0.012) 0.351 (0.015) 
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because it also uses a stochastic search but implements the search process 
using a simulated annealing approach. The code was written in R and run 
on a Dell Precision workstation (CPU 3.00 GHz; RAM 2.99 GHz, 16.0 GB). 
We recorded the total CPU time (in seconds) of the R program for both 
methods. PCA was applied first to reduce the data dimension to m = 80 
and then both methods searched for a p = 30 dimensional subspace. Table 11 
reports the time consumed and the classification accuracy on the test data 
when both methods use 500 iterations. Both MASS and SA-Sparse took a 
similar time period to run. The key difference was that MASS converged 
well before 500 iterations, while SA-Sparse did not. Hence, the error rate for 
SA-Sparse was much higher. SA-Sparse took approximately 5000 iterations 
to converge, which resulted in an order of magnitude more computational 
effort. In addition, even when 5000 iterations were chosen for SA-Sparse, the 
average test errors were still higher than for the MASS method. 

Since the parameter space for MASS is large (d x L dimensional), over- 
fitting may be a potential problem. In our nonlinear simulation study, we 
observed that when the model was very flexible, that is, A was large, MFSS 
produced poor accuracy. Figure 7 shows the average test MCR paths over 
500 iterations in the same setting as Simulation I except with Ao = 1. As 
is displayed in the left panel of Figure 7, when we assign A = 5, which is a 
more flexible model than the true model, the MCR decreases rapidly at the 
beginning, then it starts increasing dramatically, and it levels off at some 
point of time. This trend clearly indicates the presence of overfitting. When 
we used the linear MASS (i.e., assigning A = 0), which was a less flexible 
model than the true model, overfitting was less likely to occur. As we can 



Table 10 

Check of solution stability: |ppci|'s (standard errors), proportion of variance (standard 
errors) and standard errors for MCR 





PCI 


Variance (%) 


SE for MCR 


Simulation II (1) 


0.968 (0.000) 


89% (0.004) 


0.006 


Simulation II (2) 


0.830 (0.001) 


74% (0.012) 


0.007 


fMRI 


0.732 (0.002) 


42% (0.013) 


0.005 




Table 


11 




Execution time for each 


run and the average test MCR on fMRI data using SVM 




CPU 


Time/Run (sec.) 


MCR 


MASS (500 iterations) 




55.34 


0.187 (0.010) 


SA-Sparse (500 iterations) 




44.70 


0.301 (0.018) 


SA-Sparse (5000 iterations) 




444.9275 


0.237 (0.012) 
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see from the right panel of Figure 7, the MCR decreases rapidly and levels 
off. In addition, we used a small value for p (the same as the true po). This 
made the model less flexible. Presumably when p 3> po, overfitting may also 
become an issue. 

6. Discussion. MASS was designed to implement a supervised learning 
classification method with the flexibility to mimic either a variable selection 
or a variable combination method. It does this by adaptively adjusting the 
sparsity of the transformation matrix used to lower the dimensionality of 
the original data space. We use a stochastic search procedure to address 
the very high dimensional predictor space. Our simulation results suggest 
that this approach can provide extremely competitive results relative to a 
large range of classical and modern classification techniques in both linear 
and nonlinear cases. We also examined three different preliminary dimension 
reduction methods which appeared to both increase prediction accuracy as 
well as improve computational efficiency MASS seems to converge quickly 
relative to other stochastic approaches, which makes it feasible to be applied 
to large data sets. The MASS method for dimensionality reduction could also 
be generalized to the context of regression where the response is a continuous 
variable. Further studies are planned for this setting. 

APPENDIX A: PROOF OF THEOREM 1 

Suppose the theorem is not true. Then for some 5 > it must be the case 
that as n and I converge to infinity, -||Z^ — Zo|| 2 > 5 happens infinitely 
often with a probability greater than 0. Without loss of generality, we can 
assume 5 < e where e is defined in Assumption 1. But since MASS randomly 
searches the entire space of Z, as I — >• oo, we are guaranteed at some stage to 
generate a candidate set of predictors, Zn \ such that -;||Z^ — Zq|| 2 < 5 < e. 




100 200 300 400 500 100 200 300 400 500 



Fig. 7. Test MCR paths in Simulation I. Left panel: MFSS with X = 5; right panel: MFSS 
with A = . 
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The last point to prove is that this candidate set is selected by MASS and 
does not then get rejected at a later iteration. 

By Assumption 1, there exists Q with P(£l) = 1 such that for any uj € 
VL, for all n > N(w), Z n (w) is guaranteed to be selected since -||Z n (u;) — 

Zo 1 1 2 < £. Once Z n (cj) is selected, at each future iteration the same set of 
predictors will be presented to the variable selection method along with other 
possible candidates. By Assumption 1, the only way that Z n (w) would not be 
selected at the next iteration would be if an even better set of predictors was 
generated with squared distance from the true predictors even lower. Hence, 
as n and I approach infinity, it must be the case that — ||Zn — Zo|| 2 < 5. 
Thus, the theorem is proved. 

APPENDIX B: DEDUCTION OF NONLINEAR REDUCTION 

We write fj,k(t) = b(t) T 9 jik = &i(i)0j,fe,i H h b q (t)9 jjk , q , where b(i) is 

a NCS basis with q degrees of freedom and 9j k is the coefficient vector. 
We need to generate 9j :k such that (5) holds. First note that f'j'1(t) = 
9j k h"(t)b"(t) T 6j t k- Then the integral in (5) becomes 



/• 



i=i 

where T = y Ya=i t>"(i/)b"(t/) T and t±, . . . ,tx represent a fine grid of time 
points. Applying the singular value decomposition, we can write T = UDU T , 
where U is orthogonal and D = diag(di, . . . ,d q -i,0). Note the in the sin- 
gular value decomposition comes from the slope term (set to zero when you 
take the second derivative) since there are no intercept terms in the basis 
function. 

We further write 

where 9* k = \J T 9j tk , 9~ k is the first q — 1 elements of 9* k , and D~ = diag(di, 
. . . , dq-i). Hence, we need to constrain 9~ k r ~D~9~ k < A. This is easily achieved 
by first generating the 9j jk s as described in (7) and (8), and computing the 
corresponding 9~ k . We then reset 9j k via 




and let9 jk = Ve*. k = V(9- k ,9* >k>q ) T . 
We write 

f jtk (t) = b(tfV(9- k ,9l ktq f = (b(tf\J)9T k + (b(t) T V) q 9* ik>q . 
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In particular, if A = 0, all elements in 6- k are zero, in which case the integral 

is also zero and a linear fit is produced, that is, fj,k{t) = (t>(t) T U)<?#* fc q - 

Since 6* k indicates the slope term, standardizing fj^ is equivalent to 
standardize all the 9* h 's. We let 

f* k (t) = (p(t) T v) q e* jtktq = vte* jtkiV 

where v 2 = j(h(t) T XJ) q . Standardizing fj^ involves setting 

d 
k=l 

Hence, we reset 6* k by 

a* 

a* , i.fc.g 

This approach ensures that (5) holds for all candidate functions. 
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